These exercises are meant to show how to conceptually approach your data analysis but there are many more and different ways to explore your data. The most important thing to keep in mind is that you have to understand your own data and analyses. One way to achieve this is to perform visual explorations that help you to judge whether the data are appropriate for your question.
Now let’s start the fun!
In R studio.
# dplyr: A grammar of data manipulation, providing a consistent set of verbs to solve common data manipulation challenges
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# tibble: Provides a modern re-imagining of data frames, making them more user-friendly
library(tibble)
# tidyverse: A collection of R packages designed for data science, all sharing an underlying design philosophy, grammar, and data structures
library(tidyverse)## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 3.5.1 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: phyloseq
##
## microbiome R package (microbiome.github.com)
##
##
##
## Copyright (C) 2011-2022 Leo Lahti,
## Sudarshan Shetty et al. <microbiome.github.io>
##
## Attaching package: 'microbiome'
## The following object is masked from 'package:ggplot2':
##
## alpha
## The following object is masked from 'package:base':
##
## transform
# microbiome utilities: Extends functionalities of the microbiome package with additional utilities
library(microbiomeutilities)## Warning: replacing previous import 'ggplot2::alpha' by 'microbiome::alpha' when
## loading 'microbiomeutilities'
##
## Attaching package: 'microbiomeutilities'
## The following object is masked from 'package:microbiome':
##
## add_refseq
# ggplot2: A system for creating elegant and versatile data visualizations based on the grammar of graphics
library(ggplot2)
# hrbrthemes: Contains additional themes, theme components, and utilities for ggplot2
library(hrbrthemes)
# RColorBrewer: Provides color palettes for visualizing data, particularly useful in ggplot2 visualizations
library(RColorBrewer)First we need to load our data. Usually the biggest bottleneck between raw data and analyses is to get the data in the right shape for your purpose. Often this requires a little bit of data mingling. On this road - Google is your best friend to master the R universe :)
Let’s first load the relative abundance table.
For this part we are using the phyloseq package. The phyloseq package is a tool to import, store, analyze, and graphically display complex phylogenetic sequencing data that has already been clustered into Operational Taxonomic Units (OTUs), especially when there is associated sample data, phylogenetic tree, and/or taxonomic assignment of the OTUs.
First st your working directory:
And load your two files:
merged_biom_samples <- phyloseq::import_biom("../data/fastq/kraken/bracken/merge_species.biom")
merged_metagenomes <- merged_biom_samples
meta <- readxl::read_excel("../data/Glossina_metadata.xlsx")Any data already in an R session can be annotated/coerced to be recognized by phyloseq’s functions and methods. This is important, because there are lots of ways you might receive data related to a microbiome project, and not all of these will come from a popular server or workflow that is already supported by a phyloseq import function.
Let’s look at the phlyseq object.
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 91 taxa and 78 samples ]
## sample_data() Sample Data: [ 78 samples by 1 sample variables ]
## tax_table() Taxonomy Table: [ 91 taxa by 7 taxonomic ranks ]
# Summarize the phyloseq object 'merged_biom_samples'
#microbiome::summarize_phyloseq(merged_biom_samples)
# Display the first few rows of the OTU (Operational Taxonomic Unit) table
head(otu_table(merged_biom_samples))## OTU Table: [6 taxa and 78 samples]
## taxa are rows
## GI-103_S80.kraken_report_bracken_genuses
## 1688 12358
## 1686 133
## 1696 94
## 1762 359
## 1766 80
## 1764 55
## GI-59_S79.kraken_report_bracken_genuses
## 1688 2367
## 1686 72
## 1696 214
## 1762 200
## 1766 0
## 1764 0
## Gl-104_S42.kraken_report_bracken_genuses
## 1688 6272
## 1686 46
## 1696 59
## 1762 186
## 1766 38
## 1764 26
## Gl-111_S43.kraken_report_bracken_genuses
## 1688 805
## 1686 0
## 1696 0
## 1762 38
## 1766 0
## 1764 0
## Gl-121_S45.kraken_report_bracken_genuses
## 1688 119
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-124_S46.kraken_report_bracken_genuses
## 1688 48
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-125_S47.kraken_report_bracken_genuses
## 1688 1170
## 1686 69
## 1696 134
## 1762 125
## 1766 0
## 1764 0
## Gl-14_S2.kraken_report_bracken_genuses
## 1688 416
## 1686 0
## 1696 0
## 1762 46
## 1766 0
## 1764 0
## Gl-15_S3.kraken_report_bracken_genuses
## 1688 121
## 1686 0
## 1696 0
## 1762 53
## 1766 0
## 1764 0
## Gl-16_S4.kraken_report_bracken_genuses
## 1688 298
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-17_S5.kraken_report_bracken_genuses
## 1688 55
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-25_S48.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-26_S6.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-27_S7.kraken_report_bracken_genuses
## 1688 177
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-28_S8.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-29_S9.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-30_S10.kraken_report_bracken_genuses
## 1688 59
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-31_S11.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-32_S12.kraken_report_bracken_genuses
## 1688 51
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-33_S13.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-34_S14.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-36_S15.kraken_report_bracken_genuses
## 1688 163
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-38_S16.kraken_report_bracken_genuses
## 1688 4437
## 1686 53
## 1696 41
## 1762 124
## 1766 0
## 1764 0
## Gl-39_S17.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-45_S18.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-47_S19.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-51_S20.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-52_S21.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-53_S22.kraken_report_bracken_genuses
## 1688 167
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-54_S23.kraken_report_bracken_genuses
## 1688 452
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-56_S24.kraken_report_bracken_genuses
## 1688 7069
## 1686 80
## 1696 68
## 1762 186
## 1766 52
## 1764 54
## Gl-61_S25.kraken_report_bracken_genuses
## 1688 99
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-62_S26.kraken_report_bracken_genuses
## 1688 6860
## 1686 217
## 1696 135
## 1762 74
## 1766 0
## 1764 0
## Gl-63_S27.kraken_report_bracken_genuses
## 1688 207
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-64_S28.kraken_report_bracken_genuses
## 1688 112
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-66_S29.kraken_report_bracken_genuses
## 1688 3538
## 1686 196
## 1696 647
## 1762 1205
## 1766 0
## 1764 146
## Gl-67_S30.kraken_report_bracken_genuses
## 1688 819
## 1686 22
## 1696 36
## 1762 252
## 1766 0
## 1764 0
## Gl-69_S31.kraken_report_bracken_genuses
## 1688 5923
## 1686 61
## 1696 37
## 1762 145
## 1766 29
## 1764 0
## Gl-70_S32.kraken_report_bracken_genuses
## 1688 970
## 1686 0
## 1696 42
## 1762 102
## 1766 0
## 1764 0
## Gl-71_S33.kraken_report_bracken_genuses
## 1688 3971
## 1686 2663
## 1696 214
## 1762 0
## 1766 0
## 1764 0
## Gl-72_S34.kraken_report_bracken_genuses
## 1688 86
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-73_S35.kraken_report_bracken_genuses
## 1688 112
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-75_S36.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-79_S37.kraken_report_bracken_genuses
## 1688 632
## 1686 39
## 1696 70
## 1762 0
## 1766 0
## 1764 0
## Gl-7_S1.kraken_report_bracken_genuses
## 1688 79
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-80_S38.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-85_S39.kraken_report_bracken_genuses
## 1688 87
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-86_S40.kraken_report_bracken_genuses
## 1688 165
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## Gl-88_S41.kraken_report_bracken_genuses
## 1688 8020
## 1686 333
## 1696 303
## 1762 88
## 1766 0
## 1764 73
## I10_S77.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## I111_S52.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## I113_S59.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## I117_S53.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## I11_S78.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## I13_S67.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## I14_S68.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## I15_S69.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## I16_S70.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## I17_S71.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## I19_S73.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## I24_S49.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## I28_S74.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## I29_S75.kraken_report_bracken_genuses I2_S60.kraken_report_bracken_genuses
## 1688 0 0
## 1686 0 0
## 1696 0 0
## 1762 0 0
## 1766 0 0
## 1764 0 0
## I35_S50.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## I36_S51.kraken_report_bracken_genuses I3_S61.kraken_report_bracken_genuses
## 1688 0 0
## 1686 0 0
## 1696 0 0
## 1762 0 0
## 1766 0 0
## 1764 0 0
## I4_S62.kraken_report_bracken_genuses I50_S54.kraken_report_bracken_genuses
## 1688 0 0
## 1686 0 0
## 1696 0 0
## 1762 0 0
## 1766 0 0
## 1764 0 0
## I51_S55.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## I52_S56.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## I56_S57.kraken_report_bracken_genuses
## 1688 0
## 1686 0
## 1696 0
## 1762 0
## 1766 0
## 1764 0
## I57_S58.kraken_report_bracken_genuses I5_S63.kraken_report_bracken_genuses
## 1688 0 0
## 1686 0 0
## 1696 0 0
## 1762 0 0
## 1766 0 0
## 1764 0 0
## I6_S64.kraken_report_bracken_genuses I7_S65.kraken_report_bracken_genuses
## 1688 56 0
## 1686 0 0
## 1696 0 0
## 1762 0 0
## 1766 0 0
## 1764 0 0
## I8_S66.kraken_report_bracken_genuses I9_S76.kraken_report_bracken_genuses
## 1688 0 0
## 1686 0 0
## 1696 0 0
## 1762 0 0
## 1766 0 0
## 1764 0 0
## Taxonomy Table: [6 taxa by 7 taxonomic ranks]:
## Rank1 Rank2 Rank3 Rank4
## 1688 "k__Bacteria" "p__Firmicutes" "c__Bacilli" "o__Bacillales"
## 1686 "k__Bacteria" "p__Firmicutes" "c__Bacilli" "o__Bacillales"
## 1696 "k__Bacteria" "p__Firmicutes" "c__Bacilli" "o__Bacillales"
## 1762 "k__Bacteria" "p__Firmicutes" "c__Bacilli" "o__Bacillales"
## 1766 "k__Bacteria" "p__Firmicutes" "c__Bacilli" "o__Bacillales"
## 1764 "k__Bacteria" "p__Firmicutes" "c__Bacilli" "o__Bacillales"
## Rank5 Rank6 Rank7
## 1688 "f__Bacillaceae" "g__Bacillus" "s__"
## 1686 "f__Bacillaceae" "g__Anoxybacillus" "s__"
## 1696 "f__Bacillaceae" "g__Geobacillus" "s__"
## 1762 "f__Planococcaceae" "g__Lysinibacillus" "s__"
## 1766 "f__Planococcaceae" "g__Rummeliibacillus" "s__"
## 1764 "f__Planococcaceae" "g__Planococcus" "s__"
# Display the first few rows of the sample data associated with 'merged_biom_samples'
#head(sample_data(merged_biom_samples))
# Get the sample variables of the phyloseq object 'merged_biom_samples'
#sample_variables(merged_biom_samples)We need to add the metadata to the phyloseq object.
# Set the new column names in the phyloseq object
sample_names(merged_metagenomes) <- gsub("_.*$|\\.kraken_report_bracken_genuses$", "", sample_names(merged_metagenomes))
sample_names(merged_metagenomes) <- gsub("^Gl", "GI", sample_names(merged_metagenomes))
# Sort the 'meta' data frame by the 'SRA.identifier' column
meta$Sample <- gsub("_.*$", "", meta$Sample)
meta$Samples <- meta$Sample
meta <- meta %>% column_to_rownames(var = "Samples")
# meta$Sample == sample_names(merged_metagenomes)
# Associate the sorted metadata to the phyloseq object as sample data
merged_metagenomes@sam_data <- sample_data(meta)
# Remove the unnecessary 'k_' prefix in the taxonomy data
merged_metagenomes@tax_table@.Data <- substring(merged_metagenomes@tax_table@.Data, 4)
# Rename the columns of the taxonomy table to represent taxonomic ranks
colnames(merged_metagenomes@tax_table@.Data) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")Before we start anything, let’s just check out or data a little bit (sanity check). Never go blind into your analyses.
## Warning in psmelt(merged_metagenomes): The sample variables:
## Sample
## have been renamed to:
## sample_Sample
## to avoid conflicts with special phyloseq plot attribute names.
## OTU Sample Abundance sample_Sample spp sex
## 6382 46521 GI-125 45702 GI-125 Glossina palpalis gambiensis F
## 6921 49957 I117 43682 I117 Glossina palpalis gambiensis F
## 6937 49957 I113 37452 I113 Glossina palpalis gambiensis F
## 6384 46521 I13 32991 I13 Glossina palpalis gambiensis M
## 6319 46521 I8 30915 I8 Glossina palpalis gambiensis F
## 6336 46521 I6 28814 I6 Glossina palpalis gambiensis F
## Ecological.gradient Caracteristic.Gradient infection
## 6382 farmed land rice, palm trees at mangrove perif infected
## 6921 wild mangrove infected
## 6937 wild mangrove infected
## 6384 wild mangrove not infected
## 6319 wild mangrove not infected
## 6336 wild mangrove not infected
## Kingdom Phylum Class Order
## 6382 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales
## 6921 Eukaryota Euglenozoa Kinetoplastea Trypanosomatida
## 6937 Eukaryota Euglenozoa Kinetoplastea Trypanosomatida
## 6384 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales
## 6319 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales
## 6336 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales
## Family Genus Species
## 6382 Morganellaceae Wigglesworthia
## 6921 Trypanosoma
## 6937 Trypanosoma
## 6384 Morganellaceae Wigglesworthia
## 6319 Morganellaceae Wigglesworthia
## 6336 Morganellaceae Wigglesworthia
Q: How many species and samples are detected in our data set?
ntaxa(merged_metagenomes), nsamples.
Q: Look at the taxonomy is there some problem with the taxonomy?
We have contaminant in the data. Which one? (head(psmelt(merged_metagenomes)))
Q: what are the taxa with the more abundant reads?
head(otu_table(merged_metagenomes))
Extraction of the sample’s names.
## [1] "GI-103" "GI-59" "GI-104" "GI-111" "GI-121" "GI-124" "GI-125" "GI-14"
## [9] "GI-15" "GI-16" "GI-17" "GI-25" "GI-26" "GI-27" "GI-28" "GI-29"
## [17] "GI-30" "GI-31" "GI-32" "GI-33" "GI-34" "GI-36" "GI-38" "GI-39"
## [25] "GI-45" "GI-47" "GI-51" "GI-52" "GI-53" "GI-54" "GI-56" "GI-61"
## [33] "GI-62" "GI-63" "GI-64" "GI-66" "GI-67" "GI-69" "GI-70" "GI-71"
## [41] "GI-72" "GI-73" "GI-75" "GI-79" "GI-7" "GI-80" "GI-85" "GI-86"
## [49] "GI-88" "I10" "I111" "I113" "I117" "I11" "I13" "I14"
## [57] "I15" "I16" "I17" "I19" "I24" "I28" "I29" "I2"
## [65] "I35" "I36" "I3" "I4" "I50" "I51" "I52" "I56"
## [73] "I57" "I5" "I6" "I7" "I8" "I9"
Microbial species can be called at multiple taxonomic resolutions. We can easily agglomerate the data based on taxonomic ranks. Here, we agglomerate the data at Genus level.
level <- "Genus"
merged_metagenomes_taxfull <- merged_metagenomes
# Aggregate rare taxa at the family level for the phyloseq object 'merged_metagenomes'
merged_metagenomes <- aggregate_rare(merged_metagenomes, level = level, detection = 0/100, prevalence = 0/100)
tax_table_df <- as.data.frame(merged_metagenomes@tax_table)
otus_to_keep <- rownames(tax_table_df[tax_table_df$Genus != "Other", ])
merged_metagenomes <- prune_taxa(otus_to_keep, merged_metagenomes)
# Display the dimensionality of the abundances of 'merged_metagenomes_family'
dim(abundances(merged_metagenomes))## [1] 89 78
Q: How many sample and tax do we have now?
There are XX taxa and 80 samples, meaning that there are XX different Phylum level taxonomic groups. Looking at the rowData after agglomeration shows all Enterococcaceae are combined together, and all lower rank information is lost.
## Taxonomy Table: [6 taxa by 2 taxonomic ranks]:
## Genus
## Acinetobacter "Acinetobacter"
## Agarivorans "Agarivorans"
## Agromyces "Agromyces"
## Allorhizobium-Neorhizobium-Pararhizobium-Rhizobium "Allorhizobium-Neorhizobium-Pararhizobium-Rhizobium"
## Altererythrobacter "Altererythrobacter"
## Anoxybacillus "Anoxybacillus"
## unique
## Acinetobacter "Acinetobacter"
## Agarivorans "Agarivorans"
## Agromyces "Agromyces"
## Allorhizobium-Neorhizobium-Pararhizobium-Rhizobium "Allorhizobium-Neorhizobium-Pararhizobium-Rhizobium"
## Altererythrobacter "Altererythrobacter"
## Anoxybacillus "Anoxybacillus"
Now that we know a little bit about our data we can start the pre-processing.
Let us check for distribution of number of sequences retained from the Kraken/Bracken approach.
You can try to plot with different metadata.
Plotting the read count per sample
df <- psmelt(merged_metagenomes) %>% group_by(Sample, sex) %>%
summarise(sum_reads = sum(Abundance)) %>% arrange(sum_reads) ## Warning in psmelt(merged_metagenomes): The sample variables:
## Sample
## have been renamed to:
## sample_Sample
## to avoid conflicts with special phyloseq plot attribute names.
## `summarise()` has grouped output by 'Sample'. You can override using the
## `.groups` argument.
ggplot(df) +
geom_bar(aes(reorder(Sample, -sum_reads), sum_reads, fill=sex),
col="red", alpha = .7, stat="identity")
Q: What do observe, what will be the consequences
Samples might be contaminated with exogenous sequences. We have observed 1 contaminants Homo sapiens.
## Warning in psmelt(subset_taxa(merged_metagenomes, Genus == "Trypanosoma")): The sample variables:
## Sample
## have been renamed to:
## sample_Sample
## to avoid conflicts with special phyloseq plot attribute names.
## OTU Sample Abundance sample_Sample spp sex
## 54 Trypanosoma I117 43682 I117 Glossina palpalis gambiensis F
## 53 Trypanosoma I113 37452 I113 Glossina palpalis gambiensis F
## 31 Trypanosoma GI-59 27861 GI-59 Glossina palpalis gambiensis F
## 47 Trypanosoma GI-85 23509 GI-85 Glossina palpalis gambiensis F
## 44 Trypanosoma GI-75 15734 GI-75 Glossina palpalis gambiensis F
## 66 Trypanosoma I35 13337 I35 Glossina palpalis gambiensis F
## Ecological.gradient Caracteristic.Gradient infection Genus
## 54 wild mangrove infected Trypanosoma
## 53 wild mangrove infected Trypanosoma
## 31 ecotone mangrove periphery infected Trypanosoma
## 47 ecotone mangrove periphery infected Trypanosoma
## 44 farmed land rice, palm trees at mangrove perif infected Trypanosoma
## 66 wild mangrove infected Trypanosoma
## unique
## 54 Trypanosoma
## 53 Trypanosoma
## 31 Trypanosoma
## 47 Trypanosoma
## 44 Trypanosoma
## 66 Trypanosoma
#Keep only the kingdom of interest
merged_metagenomes <- subset_taxa(merged_metagenomes_taxfull, Kingdom =="Bacteria")
merged_metagenomes <- aggregate_rare(merged_metagenomes, level = level, detection = 0/100, prevalence = 0/100)
tax_table_df <- as.data.frame(merged_metagenomes@tax_table)
otus_to_keep <- rownames(tax_table_df[tax_table_df$Genus != "Other", ])
merged_metagenomes <- prune_taxa(otus_to_keep, merged_metagenomes)Microbial abundances are typically ‘compositional’ (relative) in the current microbiome profiling data sets. This is due to technical aspects of the data generation process (see e.g. Gloor et al., 2017).
The next example calculates relative abundances as these are usually easier to interpret than plain counts. For some statistical models we need to transform the data into other formats as explained in above link (and as we will see later).
detection: Detection threshold for absence/presence (percentage reads). prevalence: Prevalence threshold (in [0, 1]) (presence across samples).
Q: what is the meaning and effect of compositional?
abundances(pseq)
Bar plot are useful to have a broad look at the data.
pseq <- microbiome::transform(merged_metagenomes, transform = "compositional")
# Create a plot of the taxonomic composition at the level selected
p <- plot_composition(pseq,
taxonomic.level = "Genus", # Specify the taxonomic level for the plot
sample.sort = "Sample", # Sort samples by sample identifier
x.label = "Sample", # Label for the x-axis
group_by = "sex") + # Group samples by the 'Time' variable
# Use a color palette from the Color Brewer for filling the Family groups
#scale_fill_brewer("Species", palette = "magma-viridis") +
# Arrange the legend items in a single column
guides(fill = guide_legend(ncol = 1)) +
# Convert the y-axis to represent relative abundance as a percentage
scale_y_percent() +
# Add labels and a title to the plot
labs(x = "Samples", y = "Relative abundance (%)",
title = "Relative abundance data - Full dataset") +
# Apply a clean theme with a horizontal grid
theme_ipsum(grid = "Y") +
# Customize the theme for x-axis text and legend text
theme(axis.text.x = element_text(angle = 90, hjust = 1), # Rotate x-axis text 90 degrees for readability
legend.text = element_text(face = "italic"), # Italicize the legend text
legend.position = "right")
# Print the plot
print(p)Q: Does the profiles look similar between samples? Can you spot any trends?
A: We see that the same one or two OTU dominates all samples but there is some variability between samples. And ?
If you have time you can also visualize the other taxonomic levels (e.g. species) with the same approach. Try to come up with the code yourself.
n <- dim(tax_table(pseq))[1]
qual_col_pals = brewer.pal.info[brewer.pal.info$category == 'qual',]
col_vector = unlist(mapply(brewer.pal, qual_col_pals$maxcolors, rownames(qual_col_pals)))
col_vector <- rep(col_vector, length.out = 84)
# Create a plot of the taxonomic composition at the Genus level
p <- plot_composition(pseq,
taxonomic.level = "Phylum", # Specify the taxonomic level for the plot
sample.sort = "Sample", # Sort samples by sample identifier
x.label = "Sample", # Label for the x-axis
otu.sort = "abundance", # Sort OTUs by their abundance
group_by = "sex") + # Group samples by the 'Type' variable
# Use a color palette from the Color Brewer for filling the Genus groups
scale_fill_manual(values = col_vector) +
# Arrange the legend items in a single column
guides(fill = guide_legend(ncol = 1)) +
# Convert the y-axis to represent relative abundance as a percentage
scale_y_percent() +
# Add labels and a title to the plot
labs(x = "Samples", y = "Relative abundance (%)",
title = "Relative abundance data") +
# Apply a clean theme with a horizontal grid
theme_ipsum(grid = "Y") +
# Customize the theme for x-axis text and legend text
theme(axis.text.x = element_text(angle = 90, hjust = 1), # Rotate x-axis text 90 degrees for readability
legend.text = element_text(face = "italic")) # Italicize the legend text
# Print the plot
print(p)What do you Observe? Is it correct? What action should be taken?
Before normalization by sub-sampling, let’s have a look at rarefaction curves, evaluate your sequencing effort and make decisions. Rarefaction is a statistical technique employed to evaluate species richness based on sampling results. The process involved sub-sampling reads without replacement to a defined sequencing depth, thereby creating a standardized library size across samples. Any sample with a total read count less than the defined sequencing depth used to rarefy will be discarded. What we want to see in those curves is that it reaches a plateau meaning no new OTUs are discovered (going up on the Y axis) as sequencing is deeper (the X axis). If this happens we can be pretty confident that we sequenced all the OTUs that were present in the sample.
# Define the list of samples to remove
samples_to_remove <- c("GI-103", "GI-59", "GI-104", "GI-111", "GI-121",
"GI-124", "GI-125", "GI-14", "GI-15", "GI-16",
"GI-17", "GI-25", "GI-26", "GI-27", "GI-28",
"GI-29", "GI-30", "GI-31", "GI-32", "GI-33",
"GI-34", "GI-36", "GI-38", "GI-39", "GI-45", "GI-47", "GI-51", "I9", "I7")
# Prune these samples from your phyloseq object
physeq_cleaned <- prune_samples(!(sample_names(merged_metagenomes) %in% samples_to_remove), merged_metagenomes)
ranacapa::ggrare(physeq_cleaned, step = 10, color = "Sample", se = FALSE) +
geom_vline(xintercept = min(sample_sums(pseq)), color = "gray60")## rarefying sample GI-52
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 36
## rarefying sample GI-53
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 25
## rarefying sample GI-54
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 39
## rarefying sample GI-56
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 26
## rarefying sample GI-61
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 19
## rarefying sample GI-62
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 21
## rarefying sample GI-63
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 29
## rarefying sample GI-64
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 112
## rarefying sample GI-66
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 18
## rarefying sample GI-67
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 17
## rarefying sample GI-69
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 18
## rarefying sample GI-70
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 30
## rarefying sample GI-71
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 24
## rarefying sample GI-72
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 19
## rarefying sample GI-73
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 16
## rarefying sample GI-75
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 19
## rarefying sample GI-79
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 29
## rarefying sample GI-7
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 29
## rarefying sample GI-80
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 18
## rarefying sample GI-85
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 63
## rarefying sample GI-86
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 26
## rarefying sample GI-88
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 32
## rarefying sample I10
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 45
## rarefying sample I111
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 22
## rarefying sample I113
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 17
## rarefying sample I117
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 22
## rarefying sample I11
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 16
## rarefying sample I13
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 19
## rarefying sample I14
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 24
## rarefying sample I15
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 16
## rarefying sample I16
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 55
## rarefying sample I17
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 59
## rarefying sample I19
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 36
## rarefying sample I24
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 16
## rarefying sample I28
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 26
## rarefying sample I29
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 21
## rarefying sample I2
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 22
## rarefying sample I35
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 24
## rarefying sample I36
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 21
## rarefying sample I3
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 22
## rarefying sample I4
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 19
## rarefying sample I50
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 18
## rarefying sample I51
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 20
## rarefying sample I52
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 21
## rarefying sample I56
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 21
## rarefying sample I57
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 65
## rarefying sample I5
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 20
## rarefying sample I6
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 16
## rarefying sample I8
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 17
Q: What do you observe?
pseq <- microbiome::transform(physeq_cleaned, transform = "compositional")
n <- dim(tax_table(pseq))[1]
qual_col_pals = brewer.pal.info[brewer.pal.info$category == 'qual',]
col_vector = unlist(mapply(brewer.pal, qual_col_pals$maxcolors, rownames(qual_col_pals)))
col_vector <- rep(col_vector, length.out = 84)
# Create a plot of the taxonomic composition at the Genus level
p <- plot_composition(pseq,
taxonomic.level = "Phylum", # Specify the taxonomic level for the plot
sample.sort = "Sample", # Sort samples by sample identifier
x.label = "Sample", # Label for the x-axis
otu.sort = "abundance", # Sort OTUs by their abundance
group_by = "sex") + # Group samples by the 'Type' variable
# Use a color palette from the Color Brewer for filling the Genus groups
scale_fill_manual(values = col_vector) +
# Arrange the legend items in a single column
guides(fill = guide_legend(ncol = 1)) +
# Convert the y-axis to represent relative abundance as a percentage
scale_y_percent() +
# Add labels and a title to the plot
labs(x = "Samples", y = "Relative abundance (%)",
title = "Relative abundance - Reduced dataset") +
# Apply a clean theme with a horizontal grid
theme_ipsum(grid = "Y") +
# Customize the theme for x-axis text and legend text
theme(axis.text.x = element_text(angle = 90, hjust = 1), # Rotate x-axis text 90 degrees for readability
legend.text = element_text(face = "italic")) # Italicize the legend text
# Print the plot
print(p)